Skip to content

PERF: DataFrame.values for pyarrow-backed numeric types #52348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

lukemanley
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Perf improvement for DataFrame.values when backed by a single pyarrow numeric dtype without any nulls. I realize this is a narrow use case, so happy to close this PR if it isn't worth special casing. The current slowness is due to DataFrame.values always casting to object dtype for EA-backed frames. Unfortunately, a single null anywhere in the dataframe misses this optimization since pd.NA is used as the null representation in the ndarray.

import pandas as pd
import numpy as np

data = np.random.randn(100_000, 20)
df = pd.DataFrame(data, dtype="float64[pyarrow]")

%timeit df.values

# 98.7 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   <- main
# 3.56 ms ± 96.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Apr 1, 2023
@phofl
Copy link
Member

phofl commented Apr 2, 2023

This also changes behavior (e.g. getting float instead of object). Personally, I think this is fine but we have an unresolved discussion about this somewhere. We should decide there first before special casing here I'd say

@lukemanley
Copy link
Member Author

This also changes behavior (e.g. getting float instead of object).

Yes, the performance improvement is due to avoiding the cast to object. Note, this behavior actually already exists on main for a DataFrame with a single column:

Behavior on main:

import pandas as pd

df1 = pd.DataFrame({"a": [1.0, 2.0, 3.0]}, dtype="float64[pyarrow]")
df2 = pd.DataFrame({"a": [1.0, 2.0, pd.NA]}, dtype="float64[pyarrow]")

print(df1.values.dtype)  # float64
print(df2.values.dtype)  # object

Personally, I think this is fine but we have an unresolved discussion about this somewhere. We should decide there first before special casing here I'd say

Sure, I think you might be referring to #22791

@lukemanley
Copy link
Member Author

closing for now pending further discussion in #22791

@lukemanley lukemanley closed this Apr 9, 2023
@lukemanley lukemanley deleted the perf-df-values-arrow branch April 18, 2023 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants